Executive Summary

The purpose of this report is to try and understand what sort of factors contribute to the successfulness of a movie. The analyses is split into three parts - descriptive, inferential and predictive. The first section looks to analyse the data available. The second section aims to identify relevant variables which are correlated to the gross revenue of a movie. The final section attempts to make a predictive model based on the chosen relevant variables; this model is tested, against a subsetted test set, to see how accurate it is in predicting the gross revenue of a movie.

From our analyses, we found that:

However, there are many limitations to this dataset, including a limited number of variables, a clear bias in the number of English movies vs other language movies and movies from US vs movies from other countries. Thus, the findings drawn from this analysis need to be verified with a more extensive data set. Acquiring additional data points of movies from countries other than the US, as well as adding other variables which are likely to impact gross revenue such as profit of a movie, would help to improve the accuracy of our predictive models.

Introduction

With the advent of big data and advanced predictive analytics capabilities, very often, producers are trying to determine beforehand what will make a movie successful or fail. Rather than relying on intuition and guesswork, producers such as Netflix are trying to figure out what will make a movie a successful one by analysing historical data available in the movie industry and increase their success rates through the use of data analytics. For Netflix, the recent award-winning drama ‘House of Cards’ was a success story of theirs - descriptive, prescriptive and predictive analytics were crucial in the process of developing the drama, to gauge what content viewers were interested in.

Similarly, a question can be posed - whether one can predict the successfulness of a movie before it is released in cinemas. Our key objective for this report is to determine what makes a movie popular (based on gross revenue). Descriptive analyses and inferential analyses will look at which of the variables available in the dataset affect user ratings and gross revenue and will be followed by an attempt to develop a predictive model. This predictive model will try to determine which movies will be successful based on the relevant variable identified. 10% of the original data has been set aside to act as the test dataset and the model will be evaluated on its accuracy to predict gross revenue using Mean Squared Error and other success statistics.

Data

The dataset used in this report was downloaded from kaggle, a website for data science competitions. It was released on August 22, 2016 and was scraped from three websites using Scrapy, a Python library. The first website, the-numbers.com, is a website providing movie industry data. This website was used to scrape 5000 movie names along with relevant data such as budget and gross domestic revenue. No pattern could be identified as to how the movie names were chosen. It was therefore assumed that the data represents a random sample. The scraped movie names were then matched with imdb.com - a popular resource for movie and TV-show ratings, and celebrity content - in order to get movie scores, direct links to movie pages, and other relevant features. All the movie and actor names were later aggregated and used to extract the number of Facebook likes on their respective official Facebook pages. Finally, a face detection algorithm was applied to all movie posters to extract the number of faces in each poster. The dataset contains 5043 movies with 28 variables spanning 100 years and 66 countries.

Although imdb.com covers movies from various different countries, the website is only offered in English, which implies certain limitations. The movies included in the dataset are skewed towards an English-speaking audience. This means that movies from North America, the UK, and other English speaking countries are overrepresented. The dataset also includes more movies with release dates after 1980. In fact, multiple release years prior to 1980 include as little as one movie.

Despite these limitations, the dataset provides a solid base for in-depth analysis. As a matter of fact, there exist few movie datasets that include such variety. This will allow the analysis to reveal interesting insights and answer the report’s leading question. The dataset can be retrieved here.

The following variables are included in the dataset:

Variable Name Explanation
actor_1_facebook_likes Number of likes on main actor’s Facebook page.
actor_1_name Main actor’s name.
actor_2_facebook_likes Number of likes on first supporting actor’s Facebook page.
actor_2_name First supporting actor’s name.
actor_3_facebook_likes Number of likes on second supporting actor’s Facebook page.
actor_3_name Second supporting actor’s name.
aspect_ratio Aspect ratio the movie was shot in.
budget Budget in USD. All budget’s are estimates based on press reports.
cast_total_facebook_likes Total number of likes of all actor Facebook pages.
color Describes whether the movie was shot in color or black and white
content_rating Rating of suitability of movie for audience.
country Country the movie was shot in. If shot in multiple countries the first IMDB entry was chosen. Throughout the analysis this variable serves as proxy for the country of origin.
director_facebook_likes Number of likes on director’s Facebook page.
director_name Name of director.
duration Movie length in minutes.
facenumber_in_poster Number of faces on movie poster.
genres Movie genre.
gross Latest domestic gross revenue reported on the-number.com, in USD.
imdb_score Score voted by IMDB users, from 1 to 10 (highest).
language Language in which movie was shot.
movie_facebook_likes Number of likes on movie’s official Facebook page.
movie_imdb_link Link to movie page.
movie_title Movie name.
num_critic_for_reviews Number of critics that wrote a review.
num_user_for_reviews Number of imdb users that wrote a review.
num_voted_users Number of imdb users that rated the movie.
plot_keywords Keywords describing the movie plot.
title_year Movie release year.

Data Cleansing

An initial review of our dataset identified a few key issues to be addressed before progressing with the analysis:

  • The existence of missing values in key variables
  • The inclusion of columns unnecessary to our analysis
  • The inclusion of erroneous special characters, which may be an artefact of text format conversion
  • The inclusion of leading and trailing white spaces.

The below table shows the columns in the initial dataset that contain missing values:

Variable Name Number of missing values
actor_1_facebook_likes 7
actor_2_facebook_likes 13
actor_3_facebook_likes 23
aspect_ratio 329
budget 492
director_facebook_likes 104
duration 15
facenumber_in_poster 13
gross 884
num_critic_for_reviews 50
num_user_for_reviews 21
title_year 108

To resolve these issues, the following cleansing process is applied:

  • All rows where gross has a missing value are removed - as there are a high number, and this variable is expected to be one of the output variables
  • All rows where the budget has a missing value are removed - as there are a high number, and this variable is expected to be key to the analysis.
  • After the above processing, there are no more rows where title year has a missing value.
  • The aspect ratio column is removed - as there are a high number of missing values, and this variable is not expected to be key to the analysis.
  • The IMDB link column is removed - as this variable is not expected to be key to the analysis

  • After this step there are only 26 remaining missing values within the dataframe. For these missing values, the mean value for that column will be used to replace all misisng values within that column. Although this has some drawbacks in terms of accuracy, it allows us to maintain rows of data with other valid information, and is an unbiased approach.

  • Unwanted strings “” as well as leading and trailing white spaces are removed from the “title” column.

Although some of these steps reduce the sample size for analysis, the rows removed would either cause later analysis to fail, cause the dataset to be inconsistent across various pieces of analysis, or produce misleading results.

This cleansed dataset (of 3789 rows) contains no missing values and is used for the remainder of the analysis. The head of the data frame is printed here:

1 2 3
color Color Color Color
director_name James Cameron Gore Verbinski Sam Mendes
num_critic_for_reviews 723 302 602
duration 178 169 148
director_facebook_likes 0 563 0
actor_3_facebook_likes 855 1000 161
actor_2_name Joel David Moore Orlando Bloom Rory Kinnear
actor_1_facebook_likes 1000 40000 11000
gross 760505847 309404152 200074175
genres Action|Adventure|Fantasy|Sci-Fi Action|Adventure|Fantasy Action|Adventure|Thriller
actor_1_name CCH Pounder Johnny Depp Christoph Waltz
movie_title Avatar Pirates of the Caribbean: At World’s End Spectre
num_voted_users 886204 471220 275868
cast_total_facebook_likes 4834 48350 11700
actor_3_name Wes Studi Jack Davenport Stephanie Sigman
facenumber_in_poster 0 0 1
plot_keywords avatar|future|marine|native|paraplegic goddess|marriage ceremony|marriage proposal|pirate|singapore bomb|espionage|sequel|spy|terrorist
num_user_for_reviews 3054 1238 994
language English English English
country USA USA UK
content_rating PG-13 PG-13 PG-13
budget 237000000 300000000 245000000
title_year 2009 2007 2015
actor_2_facebook_likes 936 5000 393
imdb_score 7.9 7.1 6.8
movie_facebook_likes 33000 0 85000

10% of the cleansed dataset (377) is then set aside to be the “test” dataset, leaving the remaining 90% of our dataset (3789) as the training dataset. The training dataset alone will be used for all descriptive, inferential and predictive analysis, including model building. The “test” dataset will be used at a later stage to check the accuracy of the predicted model.

Theory

The exact factors that determine the gross revenue of a movie are not well known, which can be seen by the high variance of gross revenues of historical movie releases, even with respect to their budgets (explored further below). There are multiple mechanisms that determine the popularity, and the related gross revenue, of a movie.

If it is assumed that movie-makers (including producers, investors and executives) are well informed and profit-maximising, then one of the key predictors of movie revenue should be the budget. The mechanism for this is that a movie-maker should continue to invest money until the expected risk-adjusted returns are below the additional cost, so an increase in budget should lead to a corresponding increase in returns (gross revenue). However, not all movie-makers are necessarily profit maximising. Often there are other incentives to create movies such as recognition, prestigious awards (e.g. the Academy Awards), or an artistic mandate.

The gross revenue of a movie is also expected to be driven by its internal “quality”; this is of course difficult to measure but proxies of critics and imdb scores and reviews can be used for quantitative analysis. This should be particularly relevant in the “information age” where the “quality” of a movie, as measured by IMDB scores or critics reviews, can be shared amongst broad online and offline social networks, and can have a strong network effect of either increasing or decreasing viewership of a movie.

The popularity of a movie, in terms of ticket sales and therefore gross revenue, in theory, should also be affected by the popularity of the cast and director, as “celebrity” culture would lead fans of those individuals to watch the movies that they act or direct in, and, furthermore, to promote them within their social circles. Therefore, it is expected that there will be a positive correlation in the data between cast and director facebook likes, and the gross revenue.

As both movie-watching populations and ticket prices have been growing over time (see http://www.boxofficemojo.com/yearly/), it is expected that gross revenues of movies would also display growth over time, although this may not necessarily be the case as there are many other factors impacting the industry such as a growing number of theatrical releases for consumers to choose from (e.g. approx 4000 in 1920, and 165,329 in 2010 (data from http://www.imdb.com/year/)), and the rise in online streaming services and movie piracy.

It is also expected that PG-13 ratings attract a higher audience as they are technically open for all audiences (guidance is only recommended, not enforced, in the US) as per http://www.mpaa.org/film-ratings/, and as there are fewer content restrictions than for a G/PG movie, their content generally targets a wider range of audiences.

Other key factors that influence the popularity of a movie are internal, such as the language, genre, and key elements of the plot. These factors have multiple potential mechanisms to impact the gross revenue: they directly impact the audience for the movie both in terms of size and demographics, they may also act as a proxy for the quality of movie, as there are different standards and expectations across genres for example, and they may be more directly linked to popular trends which may cause people to go to the cinema.

These factors are explored within this report, both individually and by considering their interrelationships, for the purpose of assessing their inclusion into a predictive model for gross revenue.

Analysis

Descriptive Data Analysis

An initial correlation plot of all numerical variables displays the relationships between the variables in the dataset:

A large proportion of the numerical variables available are related to the facebook likes of various entities. In theory, these variables are likely to be most relevant to the performance of movies only after the usage of Facebook was widespread for marketing and celebrities. The below correlation matrices show the different interelationship of variables following the facebook era (post 2005), and following the era where facebook was widely used for marketing and celebrities (post 2010):

These relationships are explored throughout this report, in particular the relationship with the target output variable, gross revenue. Additionally, to look at some key statistics regarding our output variable: the maximum is $760,505,847, the minimum is $162, and the standard deviation is $68,724,881. This high standard deviation (and therefore variance), is expected as per the theory. The distribution and log distribution can be seen below, the first of which is positively skewed, and the second is negatively skewed.

The descriptive analysis below focuses on individual variables and relationships in turn, drawing both from the above correlations, and from relationships between numerican and non-numerical variables.

English and Non-English Language Movie Comparison

The following chart illustrates the changes of average gross revenues from 1920 to 2016. We can observe a large fluctuation between 1920 and 1968, this can be attributed to the lack of data during this period. The dataset only contains 45 records from 1920 to 1968, whereas there are 3744 records of movies screened after 1968.

The two charts below compare the gross revenues achieved by English and non-English movies across the years.

For English movies, the huge fluctuations in early years are due to the small amount of movie data before year 1968. After 1968, the average gross revenue grew steadily to around US$80 million in year 2016.

For non-English movies, we can observe fluctuations across the entire period (from 1920 to 2016). This is likely caused by the relative lack of non-English movie data. There are only 182 non-English movies in the dataset, compared to 3607 English movies. In recent years (after year 1990), the average gross revenue of non-English movies range from below US$10 thousand to over US$100 million.

Next, we separate the movie data into two alternative groups: USA and non-USA. Comparing the charts above and below, we can see USA movies exhibit a similar pattern as the English movies. The number of USA movies clearly dominate this dataset.

IMDB Score relationship with gross revenue

This subsection inspects the effect of IMDB score on gross revenue. From the two plots below, it can be observed that gross revenue is positively correlated with IMDB score for all the four groups (i.e. USA, Non-USA, English, and non-English).

From the first plot, we can see both English and non-English movies have positive correlation between gross revenue and IMDB score. Note that non-English movies generally have much lower gross revenues and the y-axis of this plot is on logarithmic scale. As a result, English movies have stronger correlation than non-English movies (even though non-English movies seem to have higher slope). This is as expected since most (if not all) IMDB users are English speakers. A favourable IMDB score would reflect a higher popularity amongst English-speaking audiences. On the other hand, non-English movies might not be targeted to English-speaking populations. Hence a higher IMDB score might not reflect the true popularity of non-English movies.

We can observe from the second plot that there is a stronger correlation between gross revenue and IMDB score for USA movies. This is expected because IMDB is an US-based online service and its users are predominantly from the USA. A highly popular USA movie is expected to garner more favourable reviews from USA users.

Country

In this section, we look at whether the country of origin has a huge influence on the number of movies a country produces, and the revenue that a movie generates.

Then, we check if there is any relationship between the gross revenue and IMDB score for countries with more than 10 movies.

Looking at the barplot, we observe that the number of movies originating in USA is almost 10 times that of UK. Also, the largest number of movies originate from Europe. On the other end, fewer than 20 movies in our dataset originate from Asia.

Next, we look at the gross revenues for the countries with 10 or more movies. The boxplots above indicate that the movies originating from these countries have a mean revenue of 1 million. All USA releases in our dataset generated more than 10 million, and at least 50% of movies from Australia, UK, Japan and Germany generated revenues of over 10 million. On the other hand, no movies from India, China and Italy in our dataset generated more than 10 million.

Lastly, when we look at the scatterplots, we see that UK & US movies seem to show some symmetry about ‘y = x’, with no immediately obvious correlation, but the other countries have too few movies to determine whether there is any relationship between gross revenue and IMDB score.

Budget

For movie budget, IMDB has commented that the reported figures on the website are the ‘negative costs’ of a movie, so purely the production costs and excluding advertising and promotion costs. It has also remarked that the budget costs are not very accurate and may actually be a estimate ballpark, as they can be very difficult to calculate and reported budgets may increase over time.

From the above histogram with count of movies across different budgets, we can see most movies are around the ballpark range of 1 million to 100 million, with around 85% of the movies falling under the range of 1 million to 100 million. A negative skew in the data can be observed.

The boxplots above show the budget of films in the top 13 countries - these are countries with 10 or more movies in the data set, the rest have been omitted. From this, we can see that the average budget on the films across different countries is similar, around a ballpark range of 10 million to 100 million. Mexico and Italy appears to have a lower average for film budget than the rest of the other countries.

Genres

In this section we will explore some information about the genres of the movies in our dataset, and some summary statistics for them. We must note that each movie may belong to more than one genre. In that case, in this section each movie is counted as many times as the number of genres it belongs to. In the following graphs, we only present the most popular genres. The less popular genres which include less than 10 movies have been omitted.

The first graph shows the number of movies per genre. Dramas and comedies are by far the most popular genres, followed by thriller, action and romance movies.

Next, we create a boxplot graph based on the genres to observe how gross revenue is distributed among different genres. We observe great differences in the distribution of gross revenue in each genre, with Animation movies being the most profitable and Documentaries the least profitable on average.

Finally, we observe the differences between genres based not only on their gross revenue but also on their IMDB score.

Plot Keywords

In this section, we perform a similar analysis to the above, but this time based on plot keywords. Again, each movie may have more than one keyword that describes its plot. However, if a movie has more than one keyword, we will count it as many times as the keywords used to describe it. To simplify things, we will only look at the 20 most frequently used keywords, as the vast majority of all the possible keywords appear only a few times in the whole dataset and do not provide us with useful insights. As we see, the most frequently used keywords are love, friend, murder and death, followed by some less usual ones.

Similarly to the genre analysis, we create a boxplot graph based on the keywords to observe how gross revenue is distributed among the most frequently used plot keywords. We observe that the differences in the distribution of gross revenue among keywords are not that large in comparison with those based on genres. Interestingly, however, it seems that the most frequently used keywords are not the ones that bring the greatest gross revenue, while some less frequently used keywords appear to bring the greatest gross revenue on average.

Again, we observe the differences between plot keywords based not only on their gross revenue but also on their IMDB score.

Gross revenue over time

Whereas in the past, movies were a principal source of entertainment, there are today several other sources to consider. While they are not consumed in the same way, they can be considered serious competitors to the classic movie. Some of these sources are high budget TV-Shows, video games, and user created content to name a few. It is therefore of interest to analyse the development of gross revenue over time.

The above plot shows gross revenue against from 1920 to 2016. Overall the plot shows a slight upward trend. However, there are two sections with distinct behaviors. The first section includes movies from 1920 to 1972 and shows a large variance between mean gross revenue for different years. The second section includes movies from 1972 to onwards. This section shows a smaller variance in mean gross revenue for different years. In addition, the plot has more outliers with low revenues than outliers with high revenues. Several release years, however, include fewer than ten movies and can therefore not be considered representative of a year’s gross revenue. To get a more accurate picture the following two plots remove years with less than ten entries.

The plots now show a different overall picutre. Compared to the first plot they cannot be visually divided into two distinct sections. In fact, the trendline on the second plot shows that revenues are on a slight downward trend.

The above plot shows the development of gross revenue since the widespread popularity of Facebook in 2010. There is an upward trend, which is a different pattern compared to the plot with displaying all years with more than ten entries.

Directors

Before looking into the relationship with the key output variable, gross revenue, we look into some descriptive statistics of the director data in the sample. There are 1753 distinct directors represented, the density of movies per director is shown on the figure on the right. This is clearly positively skewed, with many directors having only one movie in the dataset, and only a few having significantly more. In fact, approximately 58.64% of those directors have only one movie in this sample, leaving 41.36% that have more than one.

An initial look at the relationship between the number of movies that the director has in the dataset, and the gross revenue of those movies does not show a clearly defined correlation, which can be seen in the figure on the left.

To check against one potential confounder for the relationship that number of movies per director has with gross, the correlation between movies per director and budget is calculated.

Pearson’s Product-Moment Correlation test for number of movies per director against budget, correlation = 0.078
Test statistic df P value Alternative hypothesis
4.834 3787 0.000001388 * * * two.sided

Even though this correlation was found to be statistically significant, with a p-value of 0.0000014, it is quite small (0.078), so analysis of movies by director sample on gross revenue can be continued without an obvious confounder of budget.

Focusing on the relationship between gross revenue and number of movies per director with the additional dimension of genres, we can see from the figure below there is an approximate pattern of a slight positive correlation up to 10 movies by director (also note the y axis is of log scale), and then a less predictable path for those with directors with higher numbers of movies.

In terms of the impact that this pattern may have on our dataset, only approximately 1.2% directors have more than ten movies, which, when considering movies, constitutes approximately 7.89% of movies in our sample, so the positive trend at the low end of the scale may be more useful in prediction than the behaviour at the higher end.

Inferential Data Analysis

Language and Country

From the descriptive analysis section, we learned that the English movies (in this dataset) are generally more popular than non-English movies. In this section, we shall perform hypothesis testing to check if there is any statistically significant difference in the average gross revenues achieved by English and non-English movies.

From the above boxplot, we can observe a great difference between the revenues achieved by English and non-English movies. More than 75% of the non-English movies achieved gross revenue of US$10 million or lower. In contrast, more than half of the English movies have more than US$30 million revenues.

The below table tabulates the t-test results. The results show that the average gross revenues of English movies are significantly higher than non-English movies.

T-Test of means - Gross Revenue for English and Non-English language movies (continued below)
Test statistic df
29.7 824.5
Table continues below
P value
0.000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001121 *
Alternative hypothesis
greater

In addition, the gross revenues of English movies vary greatly (wider range of gross revenues), compared to non-English movies. The below variance test output confirms this.

F-Test of variance - Gross Revenue for English and Non-English language movies
Test statistic num df denom df P value Alternative hypothesis
24.01 3606 181 0 * * * greater

In summary, we are confident to say that the English movies in this dataset are generally more popular than non-English movies.

Next, we will focus on the differences between USA and non-USA movies. The boxplot below shows a similar pattern. Movies shot in the USA have achieved more than US$50 million in average gross revenue whereas the average gross revenue of non-USA movies is around US$25 million.

The below table tabulates the t-test results. The results show that the average gross revenues of USA movies are significantly higher than non-USA movies.

T-Test of means - Gross Revenue for USA and Non-USA movies (continued below)
Test statistic df P value
14.25 1864 0.00000000000000000000000000000000000000000003856 *
Alternative hypothesis
greater

In addition, the gross revenues of USA movies vary greatly (wider range of gross revenues), compared to non-USA movies. The below variance test output confirms this.

F-Test of variance - Gross Revenue for USA and Non-USA movies
Test statistic num df denom df P value Alternative hypothesis
2.27 2992 795 0 * * * greater

In summary, we are confident to say that the English movies in this dataset are generally more popular than non-English movies.

The huge success of English and USA movies is not surprising. With the growing soft power of USA and UK in 20th-century, more and more worldwide audiences started to watch English movies. Some highly successful Hollywood movies are even dubbed in other languages (e.g. Chinese, Japanese, Hindi) for the consumption of non-English speaking movie-goers. On the contrary, non-English movies tend to be primarily targeted at their respective local markets.

As a result, there are much more USA and English movies made and they are generally more successful (in terms of gross revenue) than non-English movies.

Content Rating

In this section, we will look at the content rating of movies and check whether it has an impact on the gross revenue generated by movies.

R vs PG-13 (continued below)
Test statistic df P value
-13.93 1733 0.000000000000000000000000000000000000000006673
*
* *
Alternative hypothesis
two.sided
PG-13 vs PG
Test statistic df P value Alternative hypothesis
-1.824 1116 0.06843 two.sided
R vs PG (continued below)
Test statistic df P value
-12.15 660.1 0.0000000000000000000000000000008728 *
Alternative hypothesis
two.sided

We conducted t-tests to test whether the mean revenue across the top three ratings - PG, PG-13 and R - is the same.

A glance at the boxplots: R rated movies tend to, on average, generate lower revenues than PG or PG-13 rated movies. Most PG and PG-13 rated movies make between 10 and 100 million, while almost 50% of R rated movies make less than 10 million.

T-test results: There is no difference in the gross revenue for PG and PG-13 movies whereas p-values lower than 5% support the alternative hypothesis that the difference in mean gross revenue for R and PG/PG-13 movies is not equal to 0.

Budget

Intuitively, the higher the budget of a movie, the better it should be and hence should attract more people to watch it, resulting in a higher gross revenue. We would like to investigate whether the imdb score of movie is affected by the budget for the particular movie.

Pearson’s Product-Moment Correlation Test for Gross Revenue against Budget, correlation = 0.223 (continued below)
Test statistic df P value
14.09 3787 0.0000000000000000000000000000000000000000000556
*
* *
Alternative hypothesis
two.sided

The covariance value of 1649098238107653 suggests that there is an upward trend - as budget increases, revenue from the movie increases. The Pearson’s correlation test between budget and gross gives us a value of 0.22, which suggests that there is a positive correlation between the budget of a movie and its gross revenue, but the correlation itself is not very strong. Under the confidence level of 95%, the p-value is significant (5.56e-44) and hence, we can reject the null hypothesis that the true correlation is equal to 0.

If we subset the data by different countries, we still do not see a strong correlation between budget versus gross revenue and cannot see a linear relationship between the two variables. However, a point to note are the limited number of movies from other countries, which is another limiting factor for us to assess whether the investigated relationship is apparent in different countries.

Genres

We will now see in more detail how the gross revenue of movies differs based on their genre. Because the genre is a nominal variable, the expected gross revenue for the movies of each genre will equal to the the average value of gross revenue of the observed movies of that particular genre. In the following diagram we can see which genres have the highest and lowest gross revenue on average and in what order.

We see that Documentaries are the less profitable by far with an average estimated gross revenue of US$ 13.38 million, while the most profitable ones appear to be Animation movies, with a gross revenue of US$ 107.66 million, followed by Adventure, Family, Fantasy and Action movies. The difference between the highest and lowest average gross revenues based on genres is US$ 94.28 million. The remaining genres appear to have more or less the same expected gross revenue. To find out if those differences in the expected gross revenue values are significant, we will perform some T-tests between the average gross revenues of movies of different genres.

T-Test for Gross Revenue between Animation and Documentary movies
Test statistic df P value Alternative hypothesis
12.35 253.6 0.0000000000000000000000000009275 * * * two.sided
T-Test for Gross Revenue between Animation and Action movies
Test statistic df P value Alternative hypothesis
4.248 270.8 0.00002965 * * * two.sided
T-Test for Gross Revenue between Animation and Adventure movies
Test statistic df P value Alternative hypothesis
1.31 322.3 0.1913 two.sided

Based on the t-tests we observe that the differences between the average values of gross revenues between some distinct genres are significant, thus including the variables genres in our linear regression models might give us an improved performance.

Plot Keywords

We will perform a similar analysis to the one we did for genres but this time on the plot keywords of the movies. Again, the keywords are categorical variables, so the expected values for the gross revenue for each of the keywords will equal the average gross revenue of the observed movies that have that speicific keyword. We will only focus on the 20 most popular keywords, as they are the ones that might have a greatest predicitve power for our future models.

In contrast with the genres of the movies, the differences in the average gross revenue based on the keywords is not that great. In particular, among the 20 most populat keywords alien seems to be the most profitable and boy the less profitable one with predicted gross revenues of US$ 69.75 million and US$ 29.15 million respectively. The difference between those two is US$ 40.6 million. Same as before, we will perform some T-tests between the average gross revenues of the movies with the most and least profitable plot keywords in order to find out if those differences are significant.

T-Test for Gross Revenue between alien and boy keywords
Test statistic df P value Alternative hypothesis
3.43 125.9 0.0008171 * * * two.sided

Same as before, based on the t-test between the keywords with the highest and lowest average gross revenue, we can conclude that the difference in the predicted gross revenue is significant. However, the number of movies that belong to each one of the most popular keywords is relatively small to the number of observations of our dataset and including the keyword variable in out models may not give us significant improvements.

Director relationship with Gross Revenue

To analyse the impact of the director (a proxy for how prolific the director is), on gross revenue, this section explores the relationship between director-specific variables, and the target outcome variable, gross revenue.

The basic relationship between the two variables can be seen on the figure to the right, and seems to show on upward trend for directors who have a low number of movies in the sample (roughly fewer than 10), and a less clear trend for directors who have more movies in the sample.

Two correlation tests are performend (output below) to test two hypotheses: if the number of movies per director is correlated with the gross revenue, and specifically if the number of movies per director is correlated with the gross revenue, for movies where the director has fewer than 10 movies.

Correlation Test for Gross Revenue and # of movies per director in sample, correlation= 0.269 (continued below)
Test statistic df
17.16 3787
Table continues below
P value
0.000000000000000000000000000000000000000000000000000000000000001201 *
Alternative hypothesis
two.sided
Correlation Test for Gross Revenue and # of movies per director in sample /n for directors with <10 movies, correlation= 0.323 (continued below)
Test statistic df
19.9 3408
Table continues below
P value
0.00000000000000000000000000000000000000000000000000000000000000000000000000000000001954 *
Alternative hypothesis
two.sided

These correlation tests do infer that there is a slight positive correlation between the # of movies the director has in the sample and the outcome variables, gross revenue. This correlation is both stronger and more statistically significant for directors with fewer than 10 movies in the sample. The directors with 10+ movies in the sample can be seen below:

If we look at the categorical variable of whether or not a director has directed >10 movies, then we can see these distributions look slightly different (see figure on left).

The plots seem to suggest that the movies where the director has 10+ movies in the sample have a higher mean and a lower variance of gross revenue. However, this may be largely explained by the number of observations in each sample.

To determine whether the difference is statistically significant we run a t test and an f test (output below), to determine whether the mean is higher for movies with directors who have directed 10+ movies, than others, and whether the variance is the same for those movies.

The data suggests that the average gross revenue is higher for movies where the director has over 10 movies in the sample - which gives evidence to the claim that prolific directors are produce more popular movies (if we see gross revenue as a proxy for popularity).

Also, the opposite of the initial intuition appears to be the case for variance, the movies made by directors with over 10 movies in the sample has higher variance (6707253793681140) than others, (4478484144924806). This can be explained by the much larger number of movies made by directors with fewer than 10 movies.

T-test of means - Gross revenue for movies with directors with 10+ movies against those with fewer
Test statistic df P value Alternative hypothesis
6.393 333 0.0000000002747 * * * greater
F-test of variance - Gross revenue for movies with directors with 10+ movies against those with fewer
Test statistic num df denom df P value Alternative hypothesis
1.498 298 3489 0.0000004816 * * * two.sided

Predictive Data Analysis

Model Building and Justification

Multiple models were created to predict gross revenue based on input parameters based on the descriptive and inferential analysis above. Two models, namely Full model and English model, were built using all data available, to determine which was the most effective when tested on the test set, and one “known” model was built based on variables that would normally be available before a movie is released.

The results shown are when the data is trained on all data in the training set, and also when the model is trained only on movies that were released post-2010 in the training set. This is because some key variables in the dataset (i.e. the facebook like variables) became relevant only since the widespread use of Facebook, and therefore change the values of the coefficients.

First full model

Linear Regression results
Dependent variable:
Gross revenue
For all years From 2010 onwards
(1) (2)
budget 0.032*** 0.343***
(0.007) (0.041)
num_voted_users 207.206*** 231.390***
(9.089) (21.257)
PG_13PG-13 24,089,934.000*** 13,180,735.000***
(1,773,093.000) (3,584,222.000)
duration 370,653.300*** 254,487.700**
(41,614.460) (111,171.600)
cast_total_facebook_likes 257.856*** 309.455***
(44.534) (109.553)
director_facebook_likes -1,001.776*** -2,194.554***
(285.247) (596.253)
Animation 26,821,134.000*** 39,353,802.000***
(4,530,454.000) (8,926,510.000)
Family 45,457,622.000*** 32,186,103.000***
(3,128,157.000) (6,935,375.000)
Mystery -6,326,802.000** -13,558,891.000**
(2,729,775.000) (5,657,763.000)
Drama -20,024,779.000*** -16,204,121.000***
(1,731,246.000) (3,587,128.000)
num_user_for_reviews 23,888.980*** 29,601.780***
(3,269.301) (9,181.291)
Constant -26,795,629.000*** -24,575,766.000**
(4,360,699.000) (11,055,605.000)
Observations 3,412 904
R2 0.529 0.658
Adjusted R2 0.528 0.654
Residual Std. Error 47,140,574.000 (df = 3400) 47,868,391.000 (df = 892)
F Statistic 347.417*** (df = 11; 3400) 156.282*** (df = 11; 892)
Note: p<0.1; p<0.05; p<0.01

Justification

The first full model is built using training set. The left column is trained using all training data, and the right column uses data for movies created post 2010. This is because some of our predictive variables (e.g. facebook likes) are only relevant as predictive variables since facebook has become widely popular for celebrities and movies.

Variables included in the model:

  • Inferential analysis showed that the budget of a movie is positively correlated with the gross revenue, which can be interpreted as movie-makers being wise enough to get a high return on investment.
  • Although IMDB score can be seen as the “quality” of a movie, as voted on by the public. At first look this variable is positively correlated with the gross revenue, but interestingly the coefficient for this variable is negative once the “number of users voted” variable is included. An interpretation of this is that the popularity of a movie is more fully explained by the number of people reviewing the movie, and after this is taken into account, the movies with a higher “quality”, as measured by imdb score, actually do less well in the box office. This could be because people are more likely to go and vote on imdb for movies they liked, and also for the obvious reason that people are more likely to go and vote on imdb for movies they have seen, so number of user votes will be linked to number of cinema tickets sold, which is linked to gross revenue. If we include number of voted users then the imdb score is statistically insignificant so is removed from the analysis.
  • Inferential analysis also showed that for some ratings (e.g. PG-13) there was some impact on gross revenue, so rating has been included in the model to include these effects.
  • The duration, when included into the model has a statistically significant positive coefficient, this could be interpreted as movie-goers perceiving longer movies as more value for money or higher quality.
  • Cast total facebook likes can also be seen as a potential predictor for gross revenue, as in theory the more popular the actor (measured by facebook likes), the more people will go to see the movie to see that actor, the more tickets are sold and therefore the higher the revenue. We can see from the positive coefficient that this seems to be reflected in the data.
  • Director facebook likes is statistically significant when included into the model, interestingly with a negative coefficient. Again, this variable is at first glance positively correlated with the gross revenue, however once you take into account the number of voted users, this relationship is reversed, potentially for similar reasons to the above.
  • Key genres have been included into the model if they had statistically significant predictive power, there is a risk of overfitting here, so the variables that have been included are those that have a logical intuition, for example animated and family films are possibly more likely to be targeted to a younger audience, which may therefore have larger potential viewership, and Mystery and Drama genres are less likely to be targetting a younger audience.
  • The number of users for review has been included as it is statistically significant and may be correlated due to the two possible mechanisms that reviews may encourage more people to watch the movie, or the more people that see the movie, the more reviews are left.

Not included into the model:

  • Movie title, director names and actor names were not included into the variable as there many different values for these variables with a small number of observations (usually one observation) for each.
  • Plot keywords were not included as there was a high number of possible values, and they were not statistically significant when included into the model.
  • Number of critic reviews is quite highly correlated with the number of user reviews, with a correlation coefficient of 0.57. So it was decided to include only one of these variable into the model, in this case number of user reviews.
  • When decade is included then none are found to be statistically significant, after other variables are included.

  • When colour is included into the model then none of the options are found to be statistically significant, there is also not a very large variation in this variable in the sample set (seen right).
  • When either language or country are included into the model, the R squared does go up, but by a small amount, and this also includes a lot of additional values to be interpreted, so the model loses some interpretability. Due to this, these factors have been left out.
  • The multiple actor “facebook likes” variables are mostly correlated with each other, as can be seen from the correlation matrix, so the total cast facebook likes has been chosen as the “overall” variable to explain the effects of facebook popularity of the cast.
  • The number of faces in the poster has not been included into the model, when included it is only slightly statistically significant, and the intuition behind this variable is unclear, so including it may lead to model overfitting.
  • IMDB score as discussed above.
  • Director number of movies, although significant individually, when other variables are taken into account loses its statistical significance, so is omitted from the model.

Second Model (English)

Linear Regression results
Dependent variable:
Gross revenue
For all years From 2010 onwards
(1) (2)
budget 0.059*** 0.553***
(0.009) (0.047)
movie_facebook_likes 754.494*** 847.768***
(47.208) (68.280)
cast_total_facebook_likes 350.522*** 498.954***
(51.256) (118.153)
director_facebook_likes 798.316** -728.850
(320.658) (634.531)
imdb_score 13,411,604.000*** 3,283,901.000
(985,450.200) (2,192,365.000)
PG_13PG-13 22,655,734.000*** 11,830,310.000***
(2,093,329.000) (3,938,389.000)
Action 21,145,340.000*** -920,754.700
(2,351,715.000) (4,445,700.000)
Adventure 25,281,466.000*** 5,764,488.000
(2,735,587.000) (5,717,462.000)
Animation 12,767,586.000** 39,250,943.000***
(5,221,415.000) (9,673,241.000)
Documentary -14,773,376.000** 8,205,518.000
(7,056,655.000) (10,655,490.000)
Fantasy 11,730,970.000*** -7,797,037.000
(3,007,034.000) (5,493,406.000)
Family 37,409,611.000*** 23,541,576.000***
(3,822,481.000) (7,971,228.000)
us_or_othersUSA 23,656,308.000*** 19,932,639.000***
(2,477,262.000) (4,688,316.000)
englishNon-English -28,372,127.000*** -6,792,282.000
(4,778,178.000) (9,758,842.000)
Constant -91,958,125.000*** -41,679,291.000***
(6,954,270.000) (14,708,669.000)
Observations 3,412 904
R2 0.382 0.594
Adjusted R2 0.380 0.587
Residual Std. Error 54,030,319.000 (df = 3397) 52,298,170.000 (df = 889)
F Statistic 150.020*** (df = 14; 3397) 92.750*** (df = 14; 889)
Note: p<0.1; p<0.05; p<0.01

Justification

The second model builds on the observation that English and USA movies are generally more successful than others, and uses this information as predictor variables. We will name this model the “English” model here.

Similar to the first full model, this model is built using the training set. The left column is trained using all training data, and the right column uses data for movies created post 2010. This is because some of our predictive variables (e.g. facebook likes) are only relevant as predictive variables since facebook has become widely popular for celebrities and movies.

Variables included in the model:

Some of the variables used in this model are identical to those used in the first model. The justifications for these variables can be found in earlier section. The list below discusses the variables used only in the second model:

  • Two new variables (“english” and “us_or_others”) have been created to separate the movies into two distinct groups (i.e. English vs non-English, USA vs non-USA). From the inferential analysis section, we have seen a huge difference in gross revenue between English and non-English (similarly, USA and non-USA). Therefore, these two variables are expected to have a major effect in predicting gross revenue.
  • This model does not consider the “number of users voted” variable. Intuitively, higher number of voters does not necessarily translate to higher movie quality. IMDB score may reflect the movie quality better instead. In the descriptive analysis section, we have observed that there is a slight positive correlation between gross revenue and IMDB score. This model includes IMDB score to predict the quality and popularity of the movie.
  • Movie facebook likes is also statistically significant when included into the model, this can be interpreted as a proxy for the popularity of the movie, and also could be that the more people who saw the movie (based on gross revenue), the more people then subsequently liked the page.
  • From the genre analysis section, it can be seen that Documentaries have signficantly lower gross revenue while Action, Adventure, Animation, Family and Fantasy movies are more popular than the rest. These six genres are included in the model.

Not included into the model:

We have already discussed some of the variables not included in this model (in the section for first model), the list below describes the rationale in excluding some other variables:

  • Number of voted users is not included as explained above.
  • Number of critic reviews is not included. Similar to reasons in excluding “number of voted users”, this metric is not considered to have a major impact on the popularity of a movie.
  • Movie duration is excluded since it should not affect the quality (and hence the popularity) of a movie.

Third Model, with only pre-known variables

Linear Regression results
Dependent variable:
Gross revenue
For all years From 2010 onwards
(1) (2)
budget 0.053*** 0.592***
(0.009) (0.045)
PG_13PG-13 26,353,129.000*** 10,560,080.000**
(2,174,109.000) (4,256,815.000)
duration 934,261.500*** 866,407.300***
(48,245.350) (126,773.400)
cast_total_facebook_likes 605.538*** 792.053***
(53.594) (125.898)
director_facebook_likes 1,357.072*** 1,362.006*
(363.913) (706.213)
Animation 41,235,635.000*** 51,638,658.000***
(5,526,847.000) (10,509,137.000)
Family 39,568,524.000*** 8,572,898.000
(3,835,844.000) (8,111,973.000)
Mystery -2,628,077.000 -10,934,946.000
(3,342,408.000) (6,706,536.000)
Drama -30,297,295.000*** -24,156,590.000***
(2,101,551.000) (4,234,060.000)
more_than_ten_movies10+ 13,305,029.000*** -6,636,179.000
(4,257,061.000) (9,760,632.000)
Constant -63,513,556.000*** -70,092,184.000***
(5,243,509.000) (12,853,456.000)
Observations 3,412 904
R2 0.291 0.517
Adjusted R2 0.289 0.512
Residual Std. Error 57,851,375.000 (df = 3401) 56,877,080.000 (df = 893)
F Statistic 139.407*** (df = 10; 3401) 95.647*** (df = 10; 893)
Note: p<0.1; p<0.05; p<0.01

Justification

To build a model to predict revenue of movies that have not yet been released, this model includes only variables that are known before the time of release. Our dataset provided us with number of facebook likes at a specific point in time, and to use this as part of our predictive model we would need to analyse number of facebook likes before the movie was released, however we will use what we currently have as a proxy to build the model. This model has lower predictive power (adjusted R squared of 0.2886446), but could be used in a more useful business context.

Model Performance

Having built the three models, we will now use the test set to test their performances. The models trained using all movies data in the training set are tested against the whole test set, whereas the models trained using post-2010 training data are tested against the post-2010 data in the test set.

The Root-Mean-Square Error of the models are tabulated below.

  Model using all data Model using post 2010 data
Full Model 46141849 37947774
English Model 56649281 43529184
Known Variables Model 60486885 46588390

For each test scenario, we included two plots:

  1. Scatter plot of Actual Revenue vs Predicted Revenue: the straight line indicates the y value of the predicted revenue. The closer the points are towards the straight line, the better the performance of the prediction model.
  2. Residual plot: the horizontal line indicate zero residual value. The closer the points are towards this line, the better the performance of the prediction model.

The full model is the best performing linear model (with lowest RMSE). Most of the points are reasonably close to the straight lines. However, the model does overestimate the popularity of certain movies. The residual tends to increase as the predicted value increases.

The other full model (which is trained using post-2010 data) seems to perform much better in predicting the popularity of post-2010 movies.

The English model performs worse than the full model in terms of RMSE. Although the model has more difficulties in identifying hugely successful movies, it is less prone in overestimating the gross revenues of less successful movies. Its residuals also stay within a certain range for most predicted values.

The English model trained using post-2010 data also performs better against post-2010 test set. This is expected since some of the independent variables used (e.g. movie facebook likes) are deemed to have more predictive power for post-2010 movies.

The last model, known variable model, left out some independent variables that are known only after the movie release (e.g. movie facebook likes, IMDB score). Consequently, it is expected to be less accurate in predicting the gross revenue. As expected, it is the worst performing model. It is less accurate in identifying highly successful movies and is prone to overestimate the popularity of certain movies.

Interestingly, the accuracy of known variables model also improved when it is trained and tested against post-2010 movie data. It seem that the independent variables used in this model are able to predict the gross revenues of recent movies (i.e. post-2010) more accurately.

Dicussion

Key Findings

Relevant Factors for Prediction of Movie Gross Revenue

This report has explored and identified some key factors that have a correlation with the gross revenue, and may be used for prediction. In this dataset, English/US-made movies, have higher gross revenues than Non-English/Non-US-made movies, this is in line with the theory. R-rated movies appear to have a lower gross revenue than PG/PG-13 rated movies, again, this is in line with the expected theory.

Budget and revenue were found to be correlated, but this correlation is not particularly strong. Although this positive correlation is in line with the expected theory, the fact that it is not particularly strong shows that investors into movie venures take on a lot of risk, as there is only a low amount of variance that can be explained by budget alone.

Gross revenue has been overall on the decline since 1980, but with a recent uptick in trend if we look more closely at the post 2010 data, and in prediction, the year of release is not statistically significant. There were opposing theoretical mechanisms that would explain the year of release’s impact on gross revenue, so this analysis supports the claim that these mechanisms roughly cancel each other out, although this may change if the increase in revenue we have seen since 2010 continues.

As expected, gross revenue also seems to be dependent on genre. The genres that have the highest gross revenues are Animation, Adventure, Family, Fantasy and Action, and the genre with the lowest gross revenue is Documentary, which has a significantly lower revenue than the remaining genres, which are fairly well clustered together.

The number of movies produced by a director appears to have a positive impact on the gross revenue, in particular there is usually an increase in gross revenue if the movie has a director with more than 10 movies in the sample. The popularity of the cast and director, as measured through facebook likes, also appear to have a positive impact on the gross revenue, although the quality of this data input should be more fully explored to verify and help to interpret these findings.

Accuracy of Predictive Models

The models created in this report had higher accuracy when using post 2010 data, which was to be expected as key variables became relevant at this point in time. The best performing model managed to explain approximately 66% of the variance in gross revenue, and the model containing only factors known before the release of the model, explains only approximately 52% of the variance.

While the full model delivers a reasonably good performance in predicting the movie gross revenue, the Root-Mean-Square Error is still rather high (around US$40 million). The key factors constituting a popular movie might not be captured in this Kaggle dataset. To improve the prediction accuracy, we could potentially explore other metadata of movies, explored further below.

Limitation of Linear Models

All three linear models predicted that certain movies would have negative gross revenues. Clearly this does not make sense since in worst case, the lowest possible revenue achieved would be zero (when absolutely nobody watches the movie in the cinema).

Note that all the three linear models have negative constant terms. Furthermore, the independent variables, which are negatively correlated with gross revenue, might cause the gross revenue to be a negative number. In this case, we would have to be careful in interpreting the prediction results.

Recommendations

If the outcomes of this report were used to make a movie to maximise gross revenues, that movie would be a long, Animated, Family movie, made in the US in the English language, rated PG-13, including by actors who are popular on facebook, made by a director who has a lot of facebook likes and has already made at least 10 movies. Although the plot keyword analysis would suggest that the movie was about aliens seeking revenge on an island, this may be seen as too derivative of the movie with the highest revenue in our sample, Avatar, and as such we would recommend a more original plot.

Next Steps

This report has identified facts and relationships about multiple variables in the dataset, namely country, language, year of release, director name, content rating and budget. Although used in the predictive models, further descriptive and inferential analysis of the other variables could also give further insight, in particular the “facebook likes” variables, with further investigation into how these were gathered and methods of handling the cases where some or all of the cast have no facebook page.

This report focuses on the relationship between these variables and the domestic gross revenue of the movie. Other interesting target variables could be:

  • The worldwide revenue of the movie
  • The imdb score, which can be seen as a proxy for “quality” of a movie.
  • The profit of a movie, which was not available in our dataset.

Further data collection would likely improve the predictive model. There are some key variables that were not present in the data which are likely to have an impact on the gross revenue of movies. These factors could include the date of release, the marketing budget and channels, the number of theatres showing the movie on initial release, the studio, whether or not the movie is a sequel or part of a franchise, whether it was adapted from a book or video game, the age of the leading actors, the number of movies that the actors have previously starred in etc., from this list in particular, it is expected that the date of release will have a large impact on the gross revenue prediction, as cinema habits are usually season-dependent.

Additionally, some of the limitations of the data could be overcome to improve the accuracy and relevance of the results. For example, only taking into account cast and director facebook likes at the time that the movie was released or proposed (depending on the stage of the process that the model would be used in practice). The data scraping process for the country of origin could also be improved to refer to where the movie was produced, as well as or instead of the current data, which related more to where it was shot. As is usual in predictive modelling, having a higher number of input samples would also likely improve the accuracy of the model, and in this case, if we want to forecast global movie revenues, our dataset would be improved by having a wider set of movies from countries other than the USA.

Similarly, more complex analysis could be done including natural language processing and semantic analysis of the script for example for sentiment analysis, or including additional text features such as frequency of certain words, number of words per minute, proportion of male to female lines etc. Cluster analysis could also be performed on movies to see whether there are similarities that help to determine the gross revenue (potentially by considering groups of actors and directors that regularly work with each other).